library(sf)
library(tidyverse)
We begin by loading the aggravated burglary data for 2023 and removing records that are missing location coordinates.
burglaries <- read_csv("../data/burglaries_2023.csv")
burglaries_clean <- burglaries |>
filter(!is.na(latitude), !is.na(longitude))
Convert to an sf object
We convert the cleaned data into a spatial object using longitude and latitude.
burglaries_sf <- st_as_sf(
burglaries_clean,
coords = c("longitude", "latitude"),
crs = 4326,
remove = FALSE
)
burglaries_sf
Simple feature collection with 1146 features and 29 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -92.51 ymin: 34.15 xmax: -86.557 ymax: 36.34
Geodetic CRS: WGS 84
Read the census data
census <- read_csv("../data/census.csv")
census
census <- census |>
mutate(
median_income = if_else(median_income < 0, NA_real_, median_income)
)
When I started plotting the census data, I noticed that the
median_income values looked incorrect. Some tracts
had extremely large negative numbers (for example, –666,666,666).
These are not real incomes—they are special codes the Census uses to
show missing data.
To fix this and get accurate plots and results, I changed all
negative income values to NA before continuing the
analysis.
Read the DC shape file data
dc_tracts <- st_read("../data/DC/DC.shp")
Reading layer `DC' from data source `C:\Users\dhans\Documents\DataScience\Program\NSS_projects\geospatial-r-dhansiv-stack\data\DC\DC.shp' using driver `ESRI Shapefile'
Simple feature collection with 174 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.0547 ymin: 35.96778 xmax: -86.51559 ymax: 36.4055
Geodetic CRS: NAD83
Before we can perform a spatial join, we need to make sure both datasets use the same coordinate reference system (CRS). The burglary points are in WGS84 (EPSG:4326), while the census tracts use NAD83 (EPSG:4269). We transform the burglary data to match the tract CRS.
burglaries_sf_4269 <- st_transform(burglaries_sf, st_crs(dc_tracts))
Now that both datasets use the same CRS, we can perform a spatial join. We use st_join() with st_within() to determine which census tract each burglary point falls inside. This attaches the tract information (such as the GEOID) to every burglary incident.
burg_with_tract <- st_join(
burglaries_sf_4269,
dc_tracts,
join = st_within
)
burg_with_tract
Simple feature collection with 1146 features and 41 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -92.51 ymin: 34.15 xmax: -86.557 ymax: 36.34
Geodetic CRS: NAD83
Before we can summarize the number of burglaries in each census tract, we need to remove the geometry so that the data behaves like a regular table. Each incident may appear multiple times (for example, if there are multiple victims), so we also remove duplicate incident numbers. After cleaning, we count the number of unique burglary incidents in each census tract.
burg_nogeome <- burg_with_tract |> st_drop_geometry()
burg_unique <- burg_nogeome |>
distinct(incident_number, GEOID, .keep_all = TRUE)
burg_counts <- burg_unique |>
count(GEOID, name = "burglaries")
head(burg_counts)
Create GEOID in the census data
We construct a GEOID field in the census data by combining the state, county, and tract codes in the same format used in the tract shapefile.
census <- census |>
mutate(
state_chr = sprintf("%02.0f", state),
GEOID = paste0(state_chr, county, tract)
)
head(census)
Because the GEOID requires a text string, the numeric state code (47) is converted into a two-digit string (‘47’) using the sprintf() function.
census_burg <- census |>
left_join(burg_counts, by = "GEOID") |>
mutate(
burglaries = replace_na(burglaries, 0)
)
head(census_burg)
Finally, we join the census and burglary attributes back to the tract shapefile so that we have a complete spatial dataset for mapping and modeling.
tract_data_sf <- dc_tracts |> left_join(census_burg, by = "GEOID")
head(tract_data_sf)
Simple feature collection with 6 features and 20 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -86.97461 ymin: 36.16132 xmax: -86.6693 ymax: 36.4055
Geodetic CRS: NAD83
STATEFP COUNTYFP TRACTCE GEOID NAME.x NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON NAME.y
1 47 037 010103 47037010103 101.03 Census Tract 101.03 G5020 S 48034082 61097 +36.3444054 -086.8608396 Census Tract 101.03, Davidson County, Tennessee
2 47 037 010202 47037010202 102.02 Census Tract 102.02 G5020 S 68394934 77571 +36.3619781 -086.7746355 Census Tract 102.02, Davidson County, Tennessee
3 47 037 010104 47037010104 101.04 Census Tract 101.04 G5020 S 65057849 251504 +36.2940028 -086.8777483 Census Tract 101.04, Davidson County, Tennessee
4 47 037 010106 47037010106 101.06 Census Tract 101.06 G5020 S 21616474 6845 +36.2610013 -086.8023491 Census Tract 101.06, Davidson County, Tennessee
5 47 037 010602 47037010602 106.02 Census Tract 106.02 G5020 S 3097282 0 +36.2533653 -086.6817552 Census Tract 106.02, Davidson County, Tennessee
6 47 037 019200 47037019200 192 Census Tract 192 G5020 S 1771200 0 +36.1711529 -086.7520191 Census Tract 192, Davidson County, Tennessee
state county tract population median_income state_chr burglaries geometry
1 47 037 010103 2411 60000 47 2 MULTIPOLYGON (((-86.91752 3...
2 47 037 010202 3919 81695 47 1 MULTIPOLYGON (((-86.82483 3...
3 47 037 010104 3002 84831 47 5 MULTIPOLYGON (((-86.9744 36...
4 47 037 010106 2948 66940 47 2 MULTIPOLYGON (((-86.83089 3...
5 47 037 010602 3688 49173 47 3 MULTIPOLYGON (((-86.6953 36...
6 47 037 019200 3853 64760 47 2 MULTIPOLYGON (((-86.76277 3...
tract_data_sf <- tract_data_sf |>
rename(NAME = NAME.x) |>
select(-NAME.y)
tract_data_sf
Simple feature collection with 174 features and 19 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.0547 ymin: 35.96778 xmax: -86.51559 ymax: 36.4055
Geodetic CRS: NAD83
First 10 features:
STATEFP COUNTYFP TRACTCE GEOID NAME NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON state county tract population median_income
1 47 037 010103 47037010103 101.03 Census Tract 101.03 G5020 S 48034082 61097 +36.3444054 -086.8608396 47 037 010103 2411 60000
2 47 037 010202 47037010202 102.02 Census Tract 102.02 G5020 S 68394934 77571 +36.3619781 -086.7746355 47 037 010202 3919 81695
3 47 037 010104 47037010104 101.04 Census Tract 101.04 G5020 S 65057849 251504 +36.2940028 -086.8777483 47 037 010104 3002 84831
4 47 037 010106 47037010106 101.06 Census Tract 101.06 G5020 S 21616474 6845 +36.2610013 -086.8023491 47 037 010106 2948 66940
5 47 037 010602 47037010602 106.02 Census Tract 106.02 G5020 S 3097282 0 +36.2533653 -086.6817552 47 037 010602 3688 49173
6 47 037 019200 47037019200 192 Census Tract 192 G5020 S 1771200 0 +36.1711529 -086.7520191 47 037 019200 3853 64760
7 47 037 015637 47037015637 156.37 Census Tract 156.37 G5020 S 9828657 1204178 +36.1045801 -086.6411995 47 037 015637 3768 52821
8 47 037 015633 47037015633 156.33 Census Tract 156.33 G5020 S 13138415 4907139 +36.1451761 -086.5766361 47 037 015633 4704 108883
9 47 037 015636 47037015636 156.36 Census Tract 156.36 G5020 S 2194409 0 +36.0833835 -086.6377772 47 037 015636 2979 73529
10 47 037 019119 47037019119 191.19 Census Tract 191.19 G5020 S 25041367 0 +35.9956858 -086.6512255 47 037 019119 7406 104307
state_chr burglaries geometry
1 47 2 MULTIPOLYGON (((-86.91752 3...
2 47 1 MULTIPOLYGON (((-86.82483 3...
3 47 5 MULTIPOLYGON (((-86.9744 36...
4 47 2 MULTIPOLYGON (((-86.83089 3...
5 47 3 MULTIPOLYGON (((-86.6953 36...
6 47 2 MULTIPOLYGON (((-86.76277 3...
7 47 10 MULTIPOLYGON (((-86.66648 3...
8 47 1 MULTIPOLYGON (((-86.61354 3...
9 47 2 MULTIPOLYGON (((-86.64844 3...
10 47 6 MULTIPOLYGON (((-86.70088 3...
Perform some exploratory analysis on your prepared dataset.
Using the tract-level dataset prepared in Part 1 (which aggregates unique burglary incidents by census tract to avoid double-counting), we now explore patterns in burglary counts and rates.
Which census tract had the highest number of burglaries?
top_burglary_count <- tract_data_sf |>
arrange(desc(burglaries)) |>
select(GEOID, NAME, burglaries) |>
slice(1)
top_burglary_count
Simple feature collection with 1 feature and 3 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -86.77516 ymin: 36.13758 xmax: -86.75074 ymax: 36.15478
Geodetic CRS: NAD83
GEOID NAME burglaries geometry
1 47037016000 160 39 MULTIPOLYGON (((-86.77516 3...
The census tract with the highest number of aggravated burglaries in 2023 is Tract 160 (GEOID 47037016000), with a total of 39 incidents.
Which census tract had the highest number of burglaries per 1000 residents?
tract_data_sf <- tract_data_sf |>
mutate(
burglary_rate_1000 = (burglaries / population) * 1000
)
top_burglary_rate <- tract_data_sf |>
filter(population > 0) |>
arrange(desc(burglary_rate_1000)) |>
select(GEOID, NAME, burglary_rate_1000, population, burglaries) |>
slice(1)
top_burglary_rate
Simple feature collection with 1 feature and 5 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -86.77516 ymin: 36.13758 xmax: -86.75074 ymax: 36.15478
Geodetic CRS: NAD83
GEOID NAME burglary_rate_1000 population burglaries geometry
1 47037016000 160 15.1751 2570 39 MULTIPOLYGON (((-86.77516 3...
The census tract with the highest burglary rate per 1,000 residents is Tract 160 (GEOID 47037016000). This tract reported 39 aggravated burglaries and has a population of 2,570, resulting in a burglary rate of approximately 15.18 incidents per 1,000 residents.
We’re interested in the relationship between median income and number of aggravated burglaries, so examine those variables on their own and together to see what you can find. You may want to perform additional calculations, create plots, etc.
Distribution of median income across census tracts
tract_data_sf |>
filter(!is.na(median_income)) |> # remove missing incomes
ggplot(aes(x = median_income)) +
geom_histogram(bins = 30, fill = "steelblue") +
labs(
title = "Distribution of Median Income by Census Tract",
x = "Median Income",
y = "Count of Tracts"
) +
theme_minimal()
The median income varies across census tracts in Davidson County, with most tracts falling between about $40,000 and $80,000. After removing invalid negative values from the Census data, the distribution shows a right-skewed pattern, with a few higher-income tracts on the upper end.
Distribution of aggravated burglaries across census tracts
ggplot(tract_data_sf, aes(x = burglaries)) +
geom_histogram(bins = 30, fill = "firebrick") +
labs(
title = "Distribution of Aggravated Burglaries",
x = "Number of Burglaries",
y = "Count of Census Tracts"
) +
theme_minimal()
Most census tracts experienced relatively few aggravated burglaries in 2023, with the majority falling between 0 and 10 incidents. A small number of tracts had substantially higher counts, creating a right-skewed distribution. This suggests that aggravated burglaries are concentrated in specific areas rather than evenly distributed across the county.
Relationship Between Median Income and Aggravated Burglary Counts
tract_data_sf |>
filter(!is.na(median_income)) |> # remove missing incomes
ggplot(aes(x = median_income, y = burglaries)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", formula = y~x, se = FALSE, color = "blue") +
labs(
title = "Median Income vs. Aggravated Burglary Counts",
x = "Median Income",
y = "Number of Burglaries"
) +
theme_minimal()
To explore whether income is related to aggravated burglary activity, I created a scatterplot comparing median income (x-axis) with the number of burglaries in each census tract (y-axis). Each point represents a census tract, and a linear trend line is included to show the overall direction of the relationship.
Fit a Poisson regression model with target variable the rate of burglaries per census tract and with predictor the median income. Offset using the log of the population so that we are looking at the rate of burglaries per population instead of the number of burglaries. How can you interpret the meaning of the output? How do the estimates from the model compare to the observed data?
model1 <- tract_data_sf |>
st_drop_geometry() |>
filter(
population>0,
!is.na(median_income),
!is.na(burglaries)
)
Poisson regression model fit
model2 <- glm(
burglaries ~ median_income + offset(log(population)),
data = model1,
family =poisson()
)
summary(model2)
Call:
glm(formula = burglaries ~ median_income + offset(log(population)),
family = poisson(), data = model1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.184e+00 9.778e-02 -53.01 <2e-16 ***
median_income -2.434e-05 1.686e-06 -14.43 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 838.73 on 169 degrees of freedom
Residual deviance: 577.53 on 168 degrees of freedom
AIC: 1089.4
Number of Fisher Scoring iterations: 5
Interpret the coefficient on median_income
model2 |> coef() |> exp()
(Intercept) median_income
0.00560792 0.99997566
Income is measured in dollars, so the effect per $1 is too small to interpret. I scaled the coefficient to a $10,000 increase, which makes the rate ratio meaningful while still using the same exponentiation approach taught in class.
exp(10000 * coef(model2)["median_income"])
median_income
0.7839851
For two census tracts with the same population, the tract with a $10,000 higher median income is estimated to have a burglary rate that is about 21.6% lower.
predict(
model2,
newdata = tibble(
median_income = c(30000, 80000),
population = c(4000, 4000)
),
type = "response"
)
1 2
10.808993 3.201285
For two census tracts with the same population of 4,000 people, the model predicts about 10.8 aggravated burglaries in a tract with a median income of $30,000, compared to about 3.2 burglaries in a tract with a median income of $80,000. This reflects the model’s estimate that higher-income areas have substantially lower burglary rates.